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TITLE OF THE INVENTION 

COARSE REPRESENTATION OF VISUAL OBJECT'S SHAPE 
FOR SEARCH/QUERY/FILTERING APPLICATIONS 

CROSS REFERENCE TO RELATED APPLICATIONS 

This is a continuation of provisional U.S. Patent Application Serial 
No. 60/118,386 filed February 1, 1999, now abandoned. 

BACKGROUND OF THE INVENTION 

The present invention relates to video data processing, and more 
particularly for a coarse representation of a visual object's shape for 
search/query/filtering applications. 

With the success of the Internet and picture and video coding 
standards, such as JPEG, MPEG-1, 2, more and more audio-visual 
information is available in digital form. Before one can use any such 
information, however, it first has to be located. Searching for textual 
information is an established technology. Many text-based search engines 
are available on the World Wide Web to search text documents. Searching 
is not yet possible for audio-visual content, since no generally recognized 
description of this material exists. MPEG- 7 is intending to standardize the 
description of such content. This description is intended to be useful in 
performing search at a very high level or at a low level. At a high level the 
search may be to locate "a person wearing a white shirt walking behind a 
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person wearing a red sweater". At lower levels for still images one may use 
characteristics like color, texture and information about the shape of 
objects in that picture. The high level queries may be mapped to the low 
level primitive queries to perform the search. 
5 Visual object searches are useful in content creation, such as to 

locate from archive the footage from a particular event, e.g. a tanker on 
fire, clips containing particular public figure, etc. Also the number of 
digital broadcast channels is increasing every day. One search/filtering 
application is to be able to select the broadcast channel (radio or TV) that 
10 is potentially interesting. 

What is desired is a descriptor that may be automatically or semi- 
automatically extracted from still images/key images of video and used in 
searches. 



15 BRIEF SUMMARY OF THE INVENTION 

Accordingly the present invention provides a coarse representation 
of a visual object's shape for search/query/filtering applications. 

The objects, advantages and other novel features of the present 
invention are apparent from the following detailed description in view of 
20 the appended claims and attached drawing. 
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BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING 
Fig. 1 is an illustrative view of a visual object within a digital 

image. 

Fig. 2 is an illustrative view of the elements of a feature vector 
5 according to the present invention. 

Fig. 3 is a block diagram view of a feature vector extraction process 
according to the present invention. 

Fig. 4 is a block diagram view of a search engine based upon coarse 
shape feature vectors, 

10 

DETAILED DESCRIPTION OF THE INVENTION 

A coarse representation of a visual object's shape may be used for 
searching based on the shape of the object. This representation is easy to 
compute, but answers a variety of queries that will be described later. 
15 However, this simple approach may not provide a very high quality shape 

representation, either in 2-D or 3-D. The following method may be used 
for visual objects in still images or in video. 

As shown in Fig. 1 in a coarse representation of shape, each 
semantic object or its sub-portions may be represented by a binding box (a 
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rectangle) in the image. A binding box of a visual object is the tightest 
rectangle that fully encompasses that visual object in an image. 

The parameters needed to represent the binding box are: 

I TopLeftComePj^ I 

5 i I 

I TopLeftComer^ I 

I I 

IBoxWidth I 

I I 

10 I BoxHeight I 

I I 
I FractionalOccupancy I 

As shown in Fig. 2 the position components TopLeftComeri^ and 
TopLeftComer^ of the binding box are defined as offsets in horizontal and 

15 vertical directions with respect to the origin of the picture, which is 

nominally at the top-left comer of the image. The FractionalOccupancy is 
a number between 0 and 1. This is the fraction of samples in the binding 
box that belong to the object being described. In order to describe 
TopLeftComerj^, TopLeftComer^, BoxWidth and BoxHeight, a normalized 

20 coordinate system is used. In this system the height of the image when 

displayed is normalized to 1.0. Subsequently, the display width is 
measured in units of display height. As an example, for a 320x240 image 
that uses square pixels for display, the height of 240 pixels is mapped to 
1.0, and the width of 320 is mapped to 1.333 (320/240). 

25 

I 0.43 I 
10.511 

An example feature vector is 1 0.22 I 

10.25 I 



30 



10.83 1 
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Low level queries served by this feature vector include: 

1. Find the visual objects that have a particular aspect ratio (ratio of 
height to width). 

2. Find the visual objects that are at least x% (a given percentage) of 
the picture size. 

3. Find the visual objects that are at most x% (a given percentage) of 
the picture size. 

4. Find the visual objects that are positioned near (x,y) (a particular 
coordinate) location in the picture. 

5. Find the visual objects that are at least x% (a given percentage) 
dense. 

6. Find the visual objects that are at most x% (a given percentage) 
dense. 

7. Find the visual objects that have at least "y" units height. 

8. Find the visual objects that have at most "y" units height. 

9. Find the visual objects that have at least "x" units width. 

10. Find the visual objects that have at most "x" units width. 

11. Estimating the trajectory of a particular visual object in time, in a 
given video. 



Overall extraction of a coarse representation of a visual object's 
shape is shown in Figure 3. The steps involved are (1) segmentation, (2) 
extraction of the bitmap of object of interest, and finally (3) estimation of 
the binding box. In this figure, the segmentation process may either be 
automatic, semi-automatic, or. The segmentation map consists of 
segmentation labels at each pixel. The set of pixels having a particular 
segmentation label belong to a distinct visual object. Thus, the second 
stage merely creates a binary map, with values "valid" (true, 1, or 255) 
wherever segmentation label equals an objectID of interest, and "invalid" 
(false, or 0] elsewhere. Identification of the largest connected region in the 
bitmap is covered in co-pending provisional U.S. Patent Application 
Serial No. 60/118, The binding box estimation procedure gets as input the 
bitmap indicating the validity of each pixel and the display aspect ratio 
that is right for the picture. 

The process of estimating the binding box itself may be broken 
down as: 

1. Estimating in pixel units the TopLeftComerj^, TopLeftComer^, 
BoxWidth, BoxHeight, and FractionalOccupancy. 

2. Normalizing the units. 

The estimation of TopLeftCorneri^, TopLeftComer^, BoxWidth, 
BoxHeight, and FractionalOccupancy is performed by the following C++ 



code segment. The inputBitmap is a 2-D array that contains the validity of 
each sample (i.e. does it belong to the object of interest or not) 
information. 

int botRightv, botRighth; 

int i, j, nr, nc, nSamples = 0; 

occupancy = 0; 

box Width = boxHeight = topLeftv = 0; 
topLefth = botRightv = botRighth = 0; 
bool valid; 
nr = imageHeight; 
nc = image Width; 
// topLeftv 

valid = false; 

for (i = 0 ; i < nr; i++) { 

for (j = 0; j < nc; ]++) { 



if (inputBitmap [!][]] is valid) { 
valid - true; 
break; 

} 

} 

if (valid) break; 

} 

topLeftv = i; 
// topLefth 

valid = false; 

for (j = 0; j < nc; j++) { 

for (1 = 0 ; i < nr; i++) { 

if CinputBitmap[i][)] is valid) { 
valid = true; 
break; 



if (valid) break; 

} 

topLefth = j; 
// botRightv 

valid = false; 

for (i = nr-1 ; i >= 0; i~) { 

for (j = 0; j < nc; { 

if (inputBitmap[i][j] is valid) { 
valid = true; 
break; 

} 

} 

if (valid) break; 

} 

botRightv = i; 
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// botRighth 

valid = false; 

for (j = nc-1; ] >= 0; j~) { 

for (i = 0 ; i < nr; { 

if [inputBitmap[i][j] is valid) { 
valid = true; 



} 

} 

if (valid) break; 

} 

botRighth = j; 

for (i = topLeftv; i <= botRightv; i++) 

for (j = topLefth; j <= botRighth; ]++) 

if (inputBitmap[i][]] is valid) nSamples++; 
if (nSamples > 0) { 
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boxHeight = botRightv - topLeftv + 1; 

boxWidth = botRighth - topLefth + 1; 

occupancy = double(nSamples) /double (boxHeight * 

boxWidth); 

} 

Display aspect ratio (DAR) is the ratio of the height of the displayed 
picture to the width of the displayed picture, say in meters. For example, 
it is 3/4 for conventional TV, 9/16 for HDTV. Given the estimated results 
(from above) that are in pixel units, the following relations may be used to 
perform normalization of the units. 

NoimBoxHeight = PixelBoxHeight/Pixel Picture Height 

NoTuiBoxWidth = PixelBoxWidth/(PixelPictureWidth*DAR) 

NormTopLeftComer^ = PixelTopLeftCorner^PixelPictureHeight 

NormTopLeftComerj^ = PixelTopLeftCornerj^/(PixelPictuTeHeight*DARJ 

NormFractionalOccupancy = PixelFractionalOccupancy 

Once the feature vectors are available for each visual object in each 
image of the database, it is quite trivial to perform a matching/query 
process based on the queries listed above. A search engine shown in Fig. 
4 compares the query number to the appropriate element of the feature 
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vectors, and performs sorting to pick the best matches. In these searches, 
the search engine needs additional metadata: the display aspect ratio, 
width and height in pixels. 

Here details are provided for the particular query " Find the visual 
5 objects that have an aspect ratio (ratio of height to width) of A". In 

response to this query, the search engine: 

1. Computes the aspect ratios (ctj) of all the visual objects in the 
database (i.e. ratio of BoxHeight to BoxWidth), 

2. Computes the Euclidean distance from aj to A for each i. Other 
10 distance metrics are also possible. 

d,= \A-a.\ 

3. Sorts d. in descending order. 

4. Presents the top results in the sorting to the user who made the 
query. 

15 The search engine can pre-compute a lot of information to speed-up 

the search. 

Thus the present invention provides a coarse representation of a 
visual object's shape for search/query/filtering applications by 
representing each object by a binding box. 



20 



CLAIM OR CLAIMS 
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WHAT IS CLAIMED IS: 

1. A method of course representation of the shape of a visible object in a 
5 digital picture comprising the steps of: 

segmenting visible objects from the digital picture; 
extracting a bitmap for an object of interest from the segmented 
visible objects; and 

estimating from the bitmap and a display aspect ratio a binding box 
10 for the object of interest. 

2. The method as recited in claim 1 wherein the estimating step comprises 
the steps of: 

estimating in pixel units a set of parameters for the binding box; and 
15 normalizing the pixel units to form a feature vector representing the 

binding box. 

3. The method as recited in claim 2 further comprising the step of 
searching a video database having visible objects, each visible object 

20 having an associated feature vector, to find those visible objects whose 

feature vectors match the feature vector of the object of interest. 

4. The method as recited in claim 3 wherein the searching step comprises 
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the steps of: 

computing aspect ratios for all visible objects in the video database; 

computing distances according to a specified distance metric 
between the desired aspect ratio and the aspect ratios for the visible 
objects in the video database; 

sorting the distances in descending order to produce a sort list of 
aspect ratios and associated visible objects; and 

displaying the visible objects associated with the aspect ratios that 
are at the top of the sort list. 
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ABSTRACT OF THE DISCXOSURE 



A method of coarse representation of a visual object's shape for 
search/query/filtering applications uses a binding box that fully 
encompasses the object of interest within the image to extract a feature 
vector. Once the feature vector is available, matching based on specific 
queries may be performed using a search engine to compare the query 
number to an appropriate element of the feature vector, performing sorting 
to pick the best matches. 
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